Preprocessing and clustering scATAC PBMCs using Scanpy and scOpen

In this tutorial, we will show how to use scOpen and (epi)Scanpy to analyze scATAC-seq. The data is from epiScanpy tutorial and it consists of ~3000 human PBMCs. Cell labels are available for validation.

Load the data

Read in the count matrix into an AnnData object, which holds many slots for annotations and different representations of the data. It also comes with its own HDF5 file format: .h5ad.

Extract the FACs information from the file names

Load the additional metadata

Load gene/transcript annotation

Download annotation file (the data are aligned on hg19)

Preprocessing

Check if the data matrix is binary - if not, binarize the data matrix

Quality controls

Actually proceed to filter the cells and peaks based on the QC plots

Looking at the QC plots after filtering

Identifying the most variable features

We aim to select a cuttof after the elbow.

We next use scOpen for dimensionality reduction

Cell clustering

1. Louvain clustering

2. kmeans clustering

3. hierarchical clustering

4. leiden clustering

5. Comparison and Adjusted Rand Index (ARI)

A few metrics to compare the quality of the different clusterings.

For all these methods. The best value possible is 1.

1) Compute the Adjusted Rand Index for the different clustering to determine which one perform best. It computes a similarity measure between two clusterings (predicted and true labels)by counting cells that are assigned in the same or different clusters between the two clusterings.

2) Compute the Homogeneity score. The score is higher when the different clusters contain only cells with the same ground truth label

3) Compute the Adjusted Mutual Information, it is measure of the similarity between two labels of the same data, while accounting for chance (the Mutual information is generally higher for two set of labels with a large number of clusters)

Differential chromatin analysis